The following R Markdown Report walks through a static version of my R Shiny App, Destination Birding. The app and this report were completed as a part of my capstone project for BrainStation’s Data Science Bootcamp in June of 2023.
A little bit about me: I have a background in environmental science and several years of GIS experience. I’m a Bird Nerd so I created this app for fellow birders in the Knoxville area!
These are the libraries needed to run the report.
# load libraries
library(tidyverse)
library(dplyr)
library(sf)
library(lubridate) # for dates
library(leaflet)
library(ggplot2)
library(auk) # for ebird
library(reticulate) # for python
library(dbscan)
library(forecast)
library(plotly)
library(zoo) # for yearmon object
My R Shiny App ended up bring written entirely in R, but in the creation process I frequently went between R and Python. These are the necessary Python imports.
import numpy as np
import pandas as pd
#for plotting
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
# stats
from statsmodels.api import tsa # time series analysis
import statsmodels.api as sm
#Time Series Analysis
from sklearn.metrics import mean_absolute_error
import matplotlib.dates as mdates
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
My data came from eBird, an online database of bird observations
providing researchers and amateur naturalists with real-time data about
bird distribution and abundance. I have been using the eBird app for
several years and have contributed almost 200 checklists of over 500
species. I put in a request with eBird to access seven years of data for
Knox County, Tennessee. eBird has their own R package, auk,
to extract the data from their file. The process is shown below.
# path to the ebird data file
f <- "data/ebird_7/ebd_US-TN-093_201601_202212_relApr-2023.txt"
f_out <- "ebd_filtered.txt"
ebd <- auk_ebd(f)
#create the data frame
knox_birds <- f %>%
auk_ebd() %>%
# auk_species() %>%
auk_filter(file = f_out, overwrite = TRUE) %>%
read_ebd()
There are many other specifications that can be included in the code above. For example, only including certain species or certain countries. More information can be found on the auk github site. For my purposes, I wanted all observations.
Next, I created a data frame with only the relevant columns for my
analysis. In eBird, if a species is only heard but the observer didn’t
see it, they can mark “heard, not seen.” This introduces an X being
recorded for observation_count, which is turned into a
NaN. I decided to fill these with ones for the time being.
Another interesting approach would be to use Funk Single Value
Decomposition to calculate expected values for NAs.
knox_birds <- read_csv("data/knox_birds.csv")
# select only relevant columns
knox_birds_reduced <- knox_birds %>%
dplyr::select(c(checklist_id,
common_name,
scientific_name,
observation_count,
locality,
latitude,
longitude,
observation_date)) %>%
mutate(observation_count = as.numeric(observation_count)) %>% # convert to numeric
mutate(observation_count = replace_na(observation_count, 1)) # fill NaNs created by "heard not seen" with 1
# View the first few rows
head(knox_birds_reduced, 5)
## # A tibble: 5 × 8
## checklist_id common_name scientific_name observation_count locality latitude
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 S27110528 American Bla… Anas rubripes 1 Dead Ho… 35.9
## 2 S27048704 American Bla… Anas rubripes 1 Dead Ho… 35.9
## 3 S26863871 American Crow Corvus brachyr… 1 The Nut… 35.9
## 4 S26615366 American Crow Corvus brachyr… 3 Victor … 36.0
## 5 S27056450 American Crow Corvus brachyr… 1 Home 35.9
## # ℹ 2 more variables: longitude <dbl>, observation_date <date>
# group observations by date, and species name to get a daily observation count of each species
daily_counts <- knox_birds_reduced %>%
group_by(observation_date, common_name) %>%
summarise(observation_count = sum(observation_count))
# View the first few rows
head(daily_counts, 5)
## # A tibble: 5 × 3
## # Groups: observation_date [1]
## observation_date common_name observation_count
## <date> <chr> <dbl>
## 1 2016-01-01 American Coot 80
## 2 2016-01-01 American Crow 98
## 3 2016-01-01 American Goldfinch 85
## 4 2016-01-01 American Robin 197
## 5 2016-01-01 American Woodcock 3
# export csv files to read in to python
#write.csv(knox_birds_reduced, "data/knox_birds_reduced.csv")
#write.csv(daily_counts, "data/daily_counts.csv")
Above, I exported my data frames. Below, they’re read into a pandas data frame to proceed with the Time Series Analysis.
knox_birds = pd.read_csv("data/knox_birds.csv")
## <string>:1: DtypeWarning:
##
## Columns (42) have mixed types. Specify dtype option on import or set low_memory=False.
knox_birds_reduced = pd.read_csv("data/knox_birds_reduced.csv")
daily_counts = pd.read_csv("data/daily_counts.csv")
#drop extra column from dfs
knox_birds_reduced.drop(columns = "Unnamed: 0", inplace = True)
daily_counts.drop(columns = "Unnamed: 0", inplace = True)
# show the first 5 rows of each data frame
knox_birds_reduced.head(5)
## checklist_id common_name ... longitude observation_date
## 0 S27110528 American Black Duck ... -84.108444 2016-01-24
## 1 S27048704 American Black Duck ... -84.108444 2016-01-22
## 2 S26863871 American Crow ... -83.864839 2016-01-12
## 3 S26615366 American Crow ... -83.998001 2016-01-02
## 4 S27056450 American Crow ... -84.211063 2016-01-22
##
## [5 rows x 8 columns]
daily_counts.head(5)
## observation_date common_name observation_count
## 0 2016-01-01 American Coot 80
## 1 2016-01-01 American Crow 98
## 2 2016-01-01 American Goldfinch 85
## 3 2016-01-01 American Robin 197
## 4 2016-01-01 American Woodcock 3
A Time Series Analysis refers to the statistical modeling, forecasting, and interpretation of data that is collected over time. It involves analyzing the patterns, trends, and dependencies present in the data to gain insights, make predictions, and inform decision-making.
To start, I did some data type conversions and checked to see if there are any missing dates.
#convert date to datetime
knox_birds_reduced["observation_date"] = pd.to_datetime(knox_birds_reduced["observation_date"])
# get first and last day in the series
first_day = knox_birds_reduced.observation_date.min()
last_day = knox_birds_reduced.observation_date.max()
# pandas `Timestamp` objects
first_day, last_day
# how many total days?
## (Timestamp('2016-01-01 00:00:00'), Timestamp('2022-12-31 00:00:00'))
last_day - first_day
# how many days are we missing? None!
## Timedelta('2556 days 00:00:00')
full_range = pd.date_range(start=first_day, end=last_day, freq="D") # every possible day
# length is not in the output, means no missing dates
full_range.difference(knox_birds_reduced.observation_date)
## DatetimeIndex(['2017-07-21'], dtype='datetime64[ns]', freq=None)
There were no missing dates. That means for every day from 2016-2022, at least one person submitted an eBird checklist!
The first time series analysis I did was on species diversity. This
did not end up in the R Shiny app, but provides some interesting
insights. First I make sure the daily counts
observation_date column is a datetime.
#convert date to datetime
daily_counts["observation_date"] = pd.to_datetime(daily_counts["observation_date"])
Then, get the number of unique occurrences of common_name (the number of distinct species), and rename the columns for clarity.
# group by observation date and count the number of unique common names. The index is now the observation date
daily_diversity = pd.DataFrame(daily_counts.groupby('observation_date')['common_name'].nunique())
# rename columns to species_num
daily_diversity = daily_diversity.rename(columns={'common_name': 'species_num'})
# view the df
daily_diversity.head(10)
## species_num
## observation_date
## 2016-01-01 70
## 2016-01-02 69
## 2016-01-03 54
## 2016-01-04 43
## 2016-01-05 64
## 2016-01-06 61
## 2016-01-07 59
## 2016-01-08 41
## 2016-01-09 53
## 2016-01-10 68
I can double check there are no missing dates by comparing the shape to the date range length (2556 days).
# check shape
daily_diversity.shape
## (2556, 1)
Let’s see some graphs! For clarity, these steps are normally separated out when working in other python environments.In R, however, I have to link them to not output various graphs.
#plot species diversity
fig1 = px.line(daily_diversity, x=daily_diversity.index, y=daily_diversity.columns)
# activate slider
#fig1.update_xaxes(rangeslider_visible=True)
# Add axis titles
fig1.update_layout(
xaxis_title="Date",
yaxis_title="Number of Species",
title="Daily Species Diveristy in Knox County from 2016-2022"
).update_xaxes(rangeslider_visible=True) # activate slider
#fig1.show()
It’s hard to make much sense out of the graph. A fundamental step in time series EDA is the trend-seasonal decomposition. Here, we extract three series from our original observation: - a trend component \(T_t\) calculated using a moving average, - a seasonal component \(S_t\) which is the monthly/daily average of the de-trended series, and - the residual \(R_t\) that remains after subtracting the trend and seasonal component from the original series.
# resample is like groupby but for time series
# the "MS" option specifies Monthly frequency by Start day
daily_diversity_monthly = daily_diversity.resample("MS").mean() # average number of species for each month
# decompose the time series
decomposition = tsa.seasonal_decompose(daily_diversity_monthly, model='additive')
# add the decomposition data
daily_diversity_monthly["Trend"] = decomposition.trend
daily_diversity_monthly["Seasonal"] = decomposition.seasonal
daily_diversity_monthly["Residual"] = decomposition.resid
# make a plot
cols = ["Trend", "Seasonal", "Residual"]
fig2 = make_subplots(rows=3, cols=1, subplot_titles=cols)
for i, col in enumerate(cols):
fig2.add_trace(
go.Scatter(x=daily_diversity_monthly.index, y=daily_diversity_monthly[col]),
row=i+1,
col=1
).update_layout(showlegend=False)
#fig2.show()
We can make a few observations immediately:
The trend is clearly upward, we can observe a long period where species diversity plateaued from 2016 to 2019 followed by a steady increase. This makes sense with COVID-19 in 2020 when everyone got into birding. In other words, species diversity likely didn’t actually increase in 2020, more people were just outside birding.
The seasonal plot shows peaks in the spring, which is consistent with migration when birds pass through the area on their way to breeding grounds.
As a quick detour, below shows the monthly average number of checklists submitted to eBird. My hypothesis that species diversity increased as a function of more people birding looks to be a reasonable assumption.
# get the number of checklists submitted per day
daily_checklists = pd.DataFrame(knox_birds.groupby('observation_date')['checklist_id'].nunique())
# change the index (date) column to a datetime
daily_checklists.index = pd.to_datetime(daily_checklists.index)
# get the
monthly_checklists = daily_checklists.resample('M').mean().round(0)
monthly_checklists.columns = ['avg_num_checklists']
# define the figure
fig3 = px.line(monthly_checklists, x=monthly_checklists.index, y=monthly_checklists.columns)
# Add axis titles
fig3.update_layout(
xaxis_title="Date",
yaxis_title="Number of Checklists",
title="Monthly Average Number of Checklists in Knox County from 2016-2022",
showlegend = False,
).update_xaxes(rangeslider_visible=True) # activate slider
The next step would be forecasting, but instead I will resume to following the R Shiny App. My point in showing this part of my EDA was to share the interesting insight in the increased species diversity seen in Knoxville starting in 2020 as a function of more people getting in to birding during COVID.
The Time Series Analysis tab in Destination Birding can be
used to predict future observations of species of interest. The tab is
interactive and will show the analysis for the species selected in the
dropdown. However, in this report, I will be hard-coding a species for
illustration purposes. Initially, I used the python
statsmodels package, but in order to maximize functionality
in the R Shiny app, I switch to the R stats package.
I’ll start by choosing a bird species and extracting its observations from the daily counts dataframe.
target_species = "Carolina Wren"
# Filter the bird observation data for the target species
target_species_df <- daily_counts %>%
filter(common_name == target_species)
head(target_species_df, 5)
## # A tibble: 5 × 3
## # Groups: observation_date [5]
## observation_date common_name observation_count
## <date> <chr> <dbl>
## 1 2016-01-01 Carolina Wren 28
## 2 2016-01-02 Carolina Wren 61
## 3 2016-01-03 Carolina Wren 30
## 4 2016-01-04 Carolina Wren 10
## 5 2016-01-05 Carolina Wren 23
Next, using the lubridate package, I’ll convert the
observation_count column to a
month_yearcolumn, since the ultimate goal is to group
observations by month.
# Convert the observation date to month-year string format
target_species_df$month_year <- format(target_species_df$observation_date, "%Y-%m")
# Group the data by month and average the counts
target_species_month <- target_species_df %>%
group_by(month_year) %>%
summarise(observation_count = mean(observation_count))
Now I’ve grouped the observations of the target species, Carolina Wren, by month and averaged the number of observations.
head(target_species_month, 10)
## # A tibble: 10 × 2
## month_year observation_count
## <chr> <dbl>
## 1 2016-01 21.6
## 2 2016-02 19.3
## 3 2016-03 22.6
## 4 2016-04 33.5
## 5 2016-05 19.5
## 6 2016-06 13.4
## 7 2016-07 12.6
## 8 2016-08 13.9
## 9 2016-09 20.0
## 10 2016-10 20.2
If you look closely, you’ll notice that there are already missing dates in the series. In order to continue with the time series analysis, the dates must be sequential. Here, I’ll fill in the missing dates and set the observation count to 0. This is another instance of where Funk Single Value Decomposition could be useful to fill in missing values. This is something I plan to explore as time allows.
# Convert month_year to yearmon object
target_species_month$month_year <- as.yearmon(target_species_month$month_year, format = "%Y-%m")
# Convert yearmon to numeric representation for sequence generation
target_species_month$month_year_numeric <- as.numeric(target_species_month$month_year)
# Complete the target_species_month dataframe with missing months and set observation_count to 0
target_species_month <- target_species_month %>%
complete(month_year_numeric = seq(min(target_species_month$month_year_numeric),
max(target_species_month$month_year_numeric), by = 1/12)) %>%
mutate(month_year = as.yearmon(month_year_numeric),
observation_count = replace(observation_count, is.na(observation_count), 0)) %>%
select(-month_year_numeric)
I then extended the predictions to go past the end of my dataset.
# Extend the target_species_month dataframe with future months and set observation_count to NA
target_species_month <- target_species_month %>%
complete(month_year = seq(min(target_species_month$month_year),
as.yearmon("2023-12"), by = 1/12),
fill = list(observation_count = NA))
Now the data is ready to be split into test and training data. The model will be trained on the data from 2016 to the end of 2021. The test data is the year of 2022. The predictions will overlap with the test year (2022) and extend for one more year, the length of 2023.
# Time series data for training and testing
train <- target_species_month[target_species_month$month_year < "2022-01-01", "observation_count"]
test <- target_species_month[target_species_month$month_year >= "2022-01-01", "observation_count"]
Time for some modeling! I’ll first create time series objects and then run the training data through the SARIMA model. More information about SARIMA models can be found here
# create time series objects
train_ts <- ts(train, frequency = 12)
test_ts <- ts(test, frequency = 12)
# Fit the SARIMA model
model <- auto.arima(train_ts,
d = 0,
D = 1,
max.p = 5,
max.q = 5,
max.P = 2,
max.Q = 2,
max.order = 5,
max.d = 2,
max.D = 2,
start.p = 1,
start.q = 0,
start.P = 0,
start.Q = 0,
)
# Make predictions
predictions <- forecast(model, h = length(test_ts))
# Extract the point forecasts from the predictions
predictions_values <- as.vector(predictions$mean)
# View model summary to see what parameters it chose and evaluation metrics
summary(model)
## Series: train_ts
## ARIMA(1,0,0)(0,1,1)[12] with drift
##
## Coefficients:
## ar1 sma1 drift
## 0.5440 -0.5406 0.4337
## s.e. 0.1081 0.1611 0.0815
##
## sigma^2 = 35.56: log likelihood = -192.98
## AIC=393.96 AICc=394.69 BIC=402.34
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.3063077 5.306093 3.661207 -5.468425 13.50995 0.5689306
## ACF1
## Training set -0.01670076
The train data (the first 6 years) has been ran through the model. It has predicted values that we will now visually compare to the actual values from the test data.
# Filter the target_species_month data for train and test sets
train_dates <- target_species_month$month_year[target_species_month$month_year < as.yearmon("2022-01")]
test_dates <- target_species_month$month_year[target_species_month$month_year >= as.yearmon("2022-01")]
#Filter the observation counts for train and test sets
train_counts <- target_species_month$observation_count[target_species_month$month_year < as.yearmon("2022-01")]
test_counts <- target_species_month$observation_count[target_species_month$month_year >= as.yearmon("2022-01")]
# Round counts for plotly tooltip
train_counts <- train_counts %>%
round(0)
test_counts <- test_counts %>%
round(0)
predictions_values <- predictions_values %>%
round(0)
# Convert to date objects for plotly graph
date_test <- as.Date(paste(test_dates, "01"), format = "%b %Y %d")
date_train <- as.Date(paste(train_dates, "01"), format = "%b %Y %d")
# Create the plot
time_series_plot <- plot_ly() %>%
add_lines(data = data.frame(x = date_train, y = train_counts),
x = ~x, y = ~y, color = I("#C7C7C7"),
line = list(dash = "solid"),
name = "Actual (Train)") %>%
add_lines(data = data.frame(x = date_test, y = test_counts),
x = ~x, y = ~y, color = I("#97D8ED"),
line = list(dash = "solid"),
alpha = 0.8,
name = "Actual (Test)") %>%
add_lines(data = data.frame(x = date_test, y = predictions_values),
x = ~x, y = ~y, color = I("#4B8FA6"),
line = list(dash = "dashdot"),
alpha = 0.8,
name = "Predicted") %>%
layout(xaxis = list(title = "Date"),
yaxis = list(title = "Observation Count"),
title = paste("Time Series Analysis for", target_species),
showlegend = TRUE) %>%
rangeslider()
# View the plot
time_series_plot
Upon visual inspection, we can the model did a good job at predicting the amount of Carolina Wrens seen in Knox County in 2021 and 2022! Carolina Wrens are a common bird and I suspect their increase in observations is also a result of more people birding rather than a true population increase.
The Species Diversity Map tab let’s you view species diversity for any given month.Perhaps you’ve never visited Knoxville or the East Coast and so while you’re here, you want to go to spots to get the most number of bird species to add to you life list. This map can help!
I’ll start by reading in the data and creating a month column. Again, for illustration purposes, I will hard code a month. The website version has full interactivity.
# read in csv
bird_data <- read.csv("data/ebird_7_reduced.csv")
# ensure date type
bird_data$observation_date <- as.Date(bird_data$observation_date)
#create a month column
bird_data$month <- format(bird_data$observation_date, "%m")
# filter for observations only in the month of June
filtered_data <- bird_data %>%
filter(month == "06")
There were around 7000 unique locations recorded in Knoxville. In order to make use of the data, it’s best to group nearby points. I decided to group nearby points into 30 locations using K-means, a clustering algorithm, and add in the cluster IDs back to the filtered data frame.
# Perform clustering
clusters <- kmeans(filtered_data[, c("latitude", "longitude")], centers = 30)
# Add cluster labels to the filtered dataframe
filtered_data$cluster <- clusters$cluster
# View the df
head(filtered_data,5)
## X checklist_id common_name scientific_name observation_count
## 1 34995 S30251718 Acadian Flycatcher Empidonax virescens 3
## 2 34996 S30600424 American Crow Corvus brachyrhynchos 1
## 3 34997 S30083320 American Crow Corvus brachyrhynchos 2
## 4 34998 S30283292 American Crow Corvus brachyrhynchos 2
## 5 34999 S30159423 American Crow Corvus brachyrhynchos 1
## locality latitude longitude observation_date
## 1 William Hastie Natural Area 35.93226 -83.87634 2016-06-16
## 2 Lyons Bend/Cove Island (private) 35.87893 -83.97420 2016-06-24
## 3 717 Fowler Ford Road Portland Tennessee 35.91696 -84.24122 2016-06-05
## 4 Third Creek Greenway 35.94841 -83.96605 2016-06-18
## 5 Forks of the River WMA 35.95680 -83.84849 2016-06-10
## month cluster
## 1 06 1
## 2 06 20
## 3 06 18
## 4 06 29
## 5 06 12
The data frame has 40 numbered clusters, so I’ll group by the
cluster_id and for each cluster, add up the number of
unique species.
On eBird, the user can name their location or use
an already established one. This can make naming the locality tricky.
Ultimately, the point on the map should reflect the name of the place at
which people are going to bird. For example, Seven Islands State Birding
Park is a popular spot to look for birds. Some people use the full name,
some call it “7 islands”, “park”, etc. My approach was to go with the
masses, meaning I’d look for the most frequent locality name in the
cluster and use that.
# Aggregate data by cluster and common name
agg_data <- filtered_data %>%
add_count(cluster, locality) %>% # add a column n with counts of the locality names at each cluster
group_by(cluster) %>%
mutate(Majority = locality[n == max(n)][1]) %>% # only keep the most frequent locality name per cluster
select(-n) %>% # do not keep the temporary variable
summarize(distinct_common_names = n_distinct(common_name), # number of distinct species per cluster
latitude = mean(latitude),
longitude = mean(longitude),
locality = Majority) %>% # name the locality with the most frequently observed name in the cluster
ungroup()
Next I defined a color palette, formatted the tooltip (the hovering
text box), and created the map using leaflet.
# Define a color palette for the bubbles
color_palette <- colorNumeric(
palette = "Blues",
domain = agg_data$distinct_common_names
)
# Prepare the text for tooltip
mytext <- paste(
"Location: ", agg_data$locality, "<br/>",
"Number of Species: ", agg_data$distinct_common_names, "<br/>",
sep = ""
) %>%
lapply(htmltools::HTML)
# Create proportional symbol map
leaflet(agg_data) %>%
addTiles() %>%
setView(lng = mean(filtered_data$longitude), # default view lat & long
lat = mean(filtered_data$latitude),
zoom = 10) %>%
addCircleMarkers(
lng = ~longitude,
lat = ~latitude,
weight = 1,
color = ~color_palette(distinct_common_names),
radius = ~distinct_common_names / 10, # Adjust the size of the circles based on count
fillOpacity = 0.7,
label = mytext)
The final part of the web app shows both a map and bar chart. The bar chart shows the sum of observations for the species of interest and the map shows where those observations occurred. The bar chart is reactive only to the species input, while the map needs both the species and month(s) to be selected. The bar chart is meant to be used as a reference. With a quick glance, it’s apparent when a certain species is most frequently observed in Knox County. I used sum instead of average, because I found average to distort the data. I’ll give an example below using the American Black Duck.
# bar chart data
filtered_bar_data <- bird_data %>%
filter(common_name == "American Black Duck")
# group by month and add up total observations
monthly_sum_counts <- filtered_bar_data %>%
group_by(month) %>%
summarise(monthly_sum = round(sum(observation_count, na.rm = TRUE))) %>%
mutate(Month = factor(month, levels = unique(month)))
# group by month and add up average observations
monthly_avg_counts <- filtered_bar_data %>%
group_by(month) %>%
summarise(monthly_avg = round(mean(observation_count, na.rm = TRUE))) %>%
mutate(Month = factor(month, levels = unique(month)))
# Create the total sum bar chart using Plotly
plot_ly(monthly_sum_counts,
x = ~Month,
y = ~monthly_sum,
type = "bar",
marker = list(color = "#4B8FA6")) %>%
layout(xaxis = list(title = "Month"),
yaxis = list(title = "Observations"),
title = "Total Number of Observations")
# Create the avg bar chart using Plotly
plot_ly(monthly_avg_counts,
x = ~Month,
y = ~monthly_avg,
type = "bar",
marker = list(color = "#4B8FA6")) %>%
layout(xaxis = list(title = "Month"),
yaxis = list(title = "Observations"),
title = "Average Number of Observations")
These two graphs are almost completely opposite. The average monthly
counts graph would lead users to think that American Black Ducks are
most likely to be in Knoxville in May. Ducks usually breed far up north
during the summer months, so what’s going on here?
There is actually only one observation (checklist) for the entire
seven years in the month of May! One person saw 4 of these ducks on May
8th, 2016. If I wanted to use averages, I’d need to divide the number of
monthly observation counts by 7 (since there are seven years of data).
However, this would make a lot of numbers very small and almost
meaningless. Instead, I decided to sum all of the observations across
the seven years. The point of the graph is to show people when they’re
most likely to see the bird of interest, not how many of that bird
they’ll see when they go looking for it. Going to the total sum graph,
we can see the winter months are when American Black Ducks are most
frequently seen in Knoxville.
As a birder myself, this is something I could have told you from knowing about the bird’s life history and migration patterns, but it’s nice to have data to back it up! You can read more about American Black Ducks here.
Next, I’ll walk through the map part of the web app.
The map reacts to both the species input and month input. If either is missing, it is blank. I also added an optional “Perform Clustering” checkbox. If a species is observed frequently in many places, there can be a lot of small points all over the map, resulting in a lot of noise. Using a clustering algorithm, like in the species diversity map, it groups nearby points. However, this time I used dbscan since it does not require the number of clusters to be pre-defined. I’ll use the American Robin in January, both clustered and not clustered, as an example.
# map data
filtered_map_data <- bird_data %>%
filter(common_name == "American Robin") %>%
filter(month == "01")
On the app, the user has the option to cluster the data or not. I will first show an unclustered map.
filtered_data <- filtered_map_data %>%
group_by(latitude, longitude, locality) %>%
summarise(total_counts = sum(observation_count, na.rm = TRUE)) # if a bird is marked as heard not
# seen, the count will be NA
Again, I’ll define a color palette, format a tooltip and create the map. This time I also defined a scaling factor so the circle markers aren’t too big.
# Define a color palette
pal <- colorNumeric(
palette = "Blues",
domain = filtered_data$total_counts
)
# Calculate the maximum observation counts
max_count <- max(filtered_data$total_counts, na.rm = TRUE)
# Define the scaling factor for bubble sizes
scaling_factor <- 10 / max_count
# Prepare the text for tooltips:
mytext <- paste(
"Species: ", "American Robin", "<br/>",
"Location: ", filtered_data$locality,"<br/>",
"Number of Observations: ", filtered_data$total_counts, "<br/>",
sep="") %>%
lapply(htmltools::HTML)
# create the map
leaflet(filtered_data) %>%
addTiles() %>%
addCircleMarkers(
lng = ~longitude,
lat = ~latitude,
weight = 1,
fillColor = ~pal(total_counts),
color = "black",
stroke = TRUE,
radius = ~total_counts * scaling_factor, # Apply the scaling factor
fillOpacity = .7,
label = mytext)
The map is obviously very busy. This is where clustering is useful. It reduces the amount of points, grouping nearby points and removing noise, making it easier to see the most important areas.
Below I walk through clustering using dbscan. First I define ranges
for eps and minPts which are two parameters
relating to search distance and minimum number of nearby points to be
considered a cluster.
# filtered data for the map
filtered_map_data <- bird_data %>%
filter(common_name == "American Robin") %>%
filter(month == "01") %>%
mutate(observation_count = replace_na(observation_count, 1)) # Replace "heard not seen" NaN with 1
# Define the range of eps and minPts values to explore
eps_range <- seq(0.0001, 0.01, by = 0.0005)
minPts_range <- seq(2, 30, by = 1)
# Variables to store the optimal parameters and number of clusters
optimal_eps <- 0
optimal_minPts <- 0
optimal_num_clusters <- Inf # Initialize with a value outside the desired range
Then dbscan loops through all combinations and picks the
fewest number of clusters that fall between the numbers 3 and 50. On the
app, if the data is too sparse/there are too few observations it will
throw a warning and won’t cluster.
# Iterate through different parameter combinations
for (eps in eps_range) {
for (minPts in minPts_range) {
# Perform DBSCAN clustering
dbscan_result <- dbscan(filtered_map_data[, c("longitude", "latitude")], eps = eps, minPts = minPts)
# Get the number of clusters
num_clusters <- max(dbscan_result$cluster)
# Update the optimal parameters if a better number of clusters is found within the desired range
if (num_clusters >= 3 && num_clusters <= 50) {
if (optimal_num_clusters == 0 || (num_clusters < optimal_num_clusters && num_clusters >= 3)) {
optimal_eps <- eps
optimal_minPts <- minPts
optimal_num_clusters <- num_clusters
}
}
}
}
# Print the number of clusters
cat("Eps:", optimal_eps, "MinPts:", optimal_minPts, "Num Clusters:", optimal_num_clusters, "\n")
## Eps: 1e-04 MinPts: 28 Num Clusters: 23
These are the optimal values determined by the loop. These are then
used to run through dbcasn one more time and the results
are joined back to the original filtered data.
# Perform DBSCAN clustering
dbscan_result <- dbscan(filtered_map_data[, c("longitude", "latitude")], eps = optimal_eps, minPts = optimal_minPts)
# Add cluster ID to the original data
filtered_map_data$cluster_id <- dbscan_result$cluster
Now the clusters are used to group the data to display the clusters on the map.
# Aggregate data by cluster and common name, excluding cluster_id = 0 which is noise
clustered_data <- filtered_map_data %>%
filter(cluster_id != 0) %>%
add_count(cluster_id, locality) %>%
group_by(cluster_id) %>%
mutate(Majority = locality[n == max(n)][1]) %>%
select(-n) %>%
summarise(
total_counts = sum(observation_count),
latitude = mean(latitude),
longitude = mean(longitude),
locality = Majority
) %>%
ungroup()
# Calculate the maximum observation counts
max_count <- max(clustered_data$total_counts, na.rm = TRUE)
# Define a color palette
pal <- colorNumeric(
palette = "Blues",
domain = clustered_data$total_counts
)
# Define the scaling factor for bubble sizes
scaling_factor <- 10 / max_count
# Prepare the text for tooltips
# Prepare the text for tooltips:
mytext <- paste(
"Species: ", "American Robin", "<br/>",
"Location: ", clustered_data$locality,"<br/>",
"Number of Observations: ", clustered_data$total_counts, "<br/>",
sep="") %>%
lapply(htmltools::HTML)
# Create the map
leaflet(clustered_data) %>%
addTiles() %>%
addCircleMarkers(
lng = ~longitude,
lat = ~latitude,
weight = 1,
fillColor = ~pal(total_counts),
color = "black",
stroke = TRUE,
radius = ~total_counts * scaling_factor,
fillOpacity = 0.7,
label = mytext
)
dbscan results in many fewer points, but they’re more
meaningful. It is much easier to see where someone should prioritize
going to find the selected bird.
This concludes the walk through of my web app! Please see the actual code for the web app to view how I created interactivity. I hope you’re inspired to go birding!